17 research outputs found
Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval
Sketch as an image search query is an ideal alternative to text in capturing
the fine-grained visual details. Prior successes on fine-grained sketch-based
image retrieval (FG-SBIR) have demonstrated the importance of tackling the
unique traits of sketches as opposed to photos, e.g., temporal vs. static,
strokes vs. pixels, and abstract vs. pixel-perfect. In this paper, we study a
further trait of sketches that has been overlooked to date, that is, they are
hierarchical in terms of the levels of detail -- a person typically sketches up
to various extents of detail to depict an object. This hierarchical structure
is often visually distinct. In this paper, we design a novel network that is
capable of cultivating sketch-specific hierarchies and exploiting them to match
sketch with photo at corresponding hierarchical levels. In particular, features
from a sketch and a photo are enriched using cross-modal co-attention, coupled
with hierarchical node fusion at every level to form a better embedding space
to conduct retrieval. Experiments on common benchmarks show our method to
outperform state-of-the-arts by a significant margin.Comment: Accepted for ORAL presentation in BMVC 202
CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
In this paper, we leverage CLIP for zero-shot sketch based image retrieval
(ZS-SBIR). We are largely inspired by recent advances on foundation models and
the unparalleled generalisation ability they seem to offer, but for the first
time tailor it to benefit the sketch community. We put forward novel designs on
how best to achieve this synergy, for both the category setting and the
fine-grained setting ("all"). At the very core of our solution is a prompt
learning setup. First we show just via factoring in sketch-specific prompts, we
already have a category-level ZS-SBIR system that overshoots all prior arts, by
a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR
synergy. Moving onto the fine-grained setup is however trickier, and requires a
deeper dive into this synergy. For that, we come up with two specific designs
to tackle the fine-grained matching nature of the problem: (i) an additional
regularisation loss to ensure the relative separation between sketches and
photos is uniform across categories, which is not the case for the gold
standard standalone triplet loss, and (ii) a clever patch shuffling technique
to help establishing instance-level structural correspondences between
sketch-photo pairs. With these designs, we again observe significant
performance gains in the region of 26.9% over previous state-of-the-art. The
take-home message, if any, is the proposed CLIP and prompt learning paradigm
carries great promise in tackling other sketch-related tasks (not limited to
ZS-SBIR) where data scarcity remains a great challenge. Project page:
https://aneeshan95.github.io/Sketch_LVM/Comment: Accepted in CVPR 2023. Project page available at
https://aneeshan95.github.io/Sketch_LVM
What Can Human Sketches Do for Object Detection?
Sketches are highly expressive, inherently capturing subjective and
fine-grained visual cues. The exploration of such innate properties of human
sketches has, however, been limited to that of image retrieval. In this paper,
for the first time, we cultivate the expressiveness of sketches but for the
fundamental vision task of object detection. The end result is a sketch-enabled
object detection framework that detects based on what \textit{you} sketch --
\textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of
zebras (instance-aware detection), and only the \textit{part} (e.g., ``head" of
a ``zebra") that you desire (part-aware detection). We further dictate that our
model works without (i) knowing which category to expect at testing (zero-shot)
and (ii) not requiring additional bounding boxes (as per fully supervised) and
class labels (as per weakly supervised). Instead of devising a model from the
ground up, we show an intuitive synergy between foundation models (e.g., CLIP)
and existing sketch models build for sketch-based image retrieval (SBIR), which
can already elegantly solve the task -- CLIP to provide model generalisation,
and SBIR to bridge the (sketchphoto) gap. In particular, we first
perform independent prompting on both sketch and photo branches of an SBIR
model to build highly generalisable sketch and photo encoders on the back of
the generalisation ability of CLIP. We then devise a training paradigm to adapt
the learned encoders for object detection, such that the region embeddings of
detected boxes are aligned with the sketch and photo embeddings from SBIR.
Evaluating our framework on standard object detection datasets like PASCAL-VOC
and MS-COCO outperforms both supervised (SOD) and weakly-supervised object
detectors (WSOD) on zero-shot setups. Project Page:
\url{https://pinakinathc.github.io/sketch-detect}Comment: Accepted as Top 12 Best Papers. Will be presented in special
single-track plenary sessions to all attendees in Computer Vision and Pattern
Recognition (CVPR), 2023. Project Page: www.pinakinathc.me/sketch-detec
Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR
This paper advances the fine-grained sketch-based image retrieval (FG-SBIR)
literature by putting forward a strong baseline that overshoots prior
state-of-the-arts by ~11%. This is not via complicated design though, but by
addressing two critical issues facing the community (i) the gold standard
triplet loss does not enforce holistic latent space geometry, and (ii) there
are never enough sketches to train a high accuracy model. For the former, we
propose a simple modification to the standard triplet loss, that explicitly
enforces separation amongst photos/sketch instances. For the latter, we put
forward a novel knowledge distillation module can leverage photo data for model
training. Both modules are then plugged into a novel plug-n-playable training
paradigm that allows for more stable training. More specifically, for (i) we
employ an intra-modal triplet loss amongst sketches to bring sketches of the
same instance closer from others, and one more amongst photos to push away
different photo instances while bringing closer a structurally augmented
version of the same photo (offering a gain of ~4-6%). To tackle (ii), we first
pre-train a teacher on the large set of unlabelled photos over the
aforementioned intra-modal photo triplet loss. Then we distill the contextual
similarity present amongst the instances in the teacher's embedding space to
that in the student's embedding space, by matching the distribution over
inter-feature distances of respective samples in both embedding spaces
(delivering a further gain of ~4-5%). Apart from outperforming prior arts
significantly, our model also yields satisfactory results on generalising to
new classes. Project page: https://aneeshan95.github.io/Sketch_PVT/Comment: Accepted in CVPR 2023. Project page available at
https://aneeshan95.github.io/Sketch_PVT
PQA: Perceptual Question Answering
Perceptual organization remains one of the very few established theories on the human visual system. It underpinned many pre-deep seminal works on segmentation and detection, yet research has seen a rapid decline since the preferential shift to learning deep models. Of the limited attempts, most aimed at interpreting complex visual scenes using perceptual organizational rules. This has however been proven to be sub-optimal, since models were unable to effectively capture the visual complexity in real-world imagery. In this paper, we rejuvenate the study of perceptual organization, by advocating two positional changes: (i) we examine purposefully generated synthetic data, instead of complex real imagery, and (ii) we ask machines to synthesize novel perceptually-valid patterns, instead of explaining existing data. Our overall answer lies with the introduction of a novel visual challenge – the challenge of perceptual question answering (PQA). Upon observing example perceptual question-answer pairs, the goal for PQA is to solve similar questions by generating answers entirely from scratch (see Figure 1). Our first contribution is therefore the first dataset of perceptual question-answer pairs, each generated specifically for a particular Gestalt principle. We then borrow insights from human psychology to design an agent that casts perceptual organization as a self-attention problem, where a proposed grid-to-grid mapping network directly generates answer patterns from scratch. Experiments show our agent to outperform a selection of naive and strong baselines. A human study however indicates that ours uses astronomically more data to learn when compared to an average human, necessitating future research (with or without our dataset)
Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval
Sketch as an image search query is an ideal alternative to text in capturing the finegrained
visual details. Prior successes on fine-grained sketch-based image retrieval (FGSBIR)
have demonstrated the importance of tackling the unique traits of sketches as
opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixelperfect.
In this paper, we study a further trait of sketches that has been overlooked to
date, that is, they are hierarchical in terms of the levels of detail – a person typically
sketches up to various extents of detail to depict an object. This hierarchical structure
is often visually distinct. In this paper, we design a novel network that is capable of
cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at
corresponding hierarchical levels. In particular, features from a sketch and a photo are
enriched using cross-modal co-attention, coupled with hierarchical node fusion at every
level to form a better embedding space to conduct retrieval. Experiments on common
benchmarks show our method to outperform state-of-the-arts by a significant margin